    Introducción al procesamiento del habla mediante técnicas de deep learning

    A lo largo de este trabajo se estudiarán de forma teórica la arquitectura de los reconocedores de voz basados en modelos generativos. En concreto se analizarán dos sistemas distintos: los sistemas basados en modelos ocultos de Markov y mezclas de gaussianas, y los modelos híbridos entre modelos ocultos de Markov y redes neuronales. Para ello se comenzará realizando una introducción al problema del recono- cimiento de voz. Después se analizarán de forma general modelos de mezclas de gaussianas, los modelos ocultos de Markov y las redes neuronales. Finalmente se presentará la herramienta Kaldi, con la cuál se realizarán diversos experimentos para comparar y analizar las características de los distintos sistemas de reconocimiento de voz. En particular, nos centraremos en estudiar el comportamiento de las redes neuronales

    A Differentiable Generative Adversarial Network for Open Domain Dialogue

    Paper presented at the IWSDS 2019: International Workshop on Spoken Dialogue Systems Technology, Siracusa, Italy, April 24-26, 2019This work presents a novel methodology to train open domain neural dialogue systems within the framework of Generative Adversarial Networks with gradient-based optimization methods. We avoid the non-differentiability related to text-generating networks approximating the word vector corresponding to each generated token via a top-k softmax. We show that a weighted average of the word vectors of the most probable tokens computed from the probabilities resulting of the top-k softmax leads to a good approximation of the word vector of the generated token. Finally we demonstrate through a human evaluation process that training a neural dialogue system via adversarial learning with this method successfully discourages it from producing generic responses. Instead it tends to produce more informative and variate ones.This work has been partially funded by the Basque Government under grant PRE_2017_1_0357, by the University of the Basque Country UPV/EHU under grant PIF17/310, and by the H2020 RIA EMPATHIC (Grant N: 769872)

    A multilingual neural coaching model with enhanced long-term dialogue structure

    In this work we develop a fully data-driven conversational agent capable of carrying out motivational coach- ing sessions in Spanish, French, Norwegian, and English. Unlike the majority of coaching, and in general well-being related conversational agents that can be found in the literature, ours is not designed by hand- crafted rules. Instead, we directly model the coaching strategy of professionals with end users. To this end, we gather a set of virtual coaching sessions through a Wizard of Oz platform, and apply state of the art Natural Language Processing techniques. We employ a transfer learning approach, pretraining GPT2 neural language models and fine-tuning them on our corpus. However, since these only take as input a local dialogue history, a simple fine-tuning procedure is not capable of modeling the long-term dialogue strategies that appear in coaching sessions. To alleviate this issue, we first propose to learn dialogue phase and scenario embeddings in the fine-tuning stage. These indicate to the model at which part of the dialogue it is and which kind of coaching session it is carrying out. Second, we develop a global deep learning system which controls the long-term structure of the dialogue. We also show that this global module can be used to visualize and interpret the decisions taken by the the conversational agent, and that the learnt representations are comparable to dialogue acts. Automatic and human evaluation show that our proposals serve to improve the baseline models. Finally, interaction experiments with coaching experts indicate that the system is usable and gives rise to positive emotions in Spanish, French and English, while the results in Norwegian point out that there is still work to be done in fully data driven approaches with very low resource languages.This work has been partially funded by the Basque Government under grant PRE_2017_1_0357 and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 769872

    Audio Embeddings help to learn better dialogue policies

    Presentado en ASRU 2021, Cartagena (Colombia), 13-17 diciembre 2021Neural transformer architectures have gained a lot of interest for text-based dialogue management in the last few years. They have shown high learning capabilities for open domain dialogue with huge amounts of data and also for domain adaptation in task-oriented setups. But the potential benefits of exploiting the users’ audio signal have rarely been ex- plored in such frameworks. In this work, we combine text dialogue history representations generated by a GPT-2 model with audio embeddings obtained by the recently released Wav2Vec2 transformer model. We jointly fine-tune these models to learn dialogue policies via supervised learning and two policy gradient-based reinforcement learning algorithms. Our experimental results, using the DSTC2 dataset and a sim- ulated user model capable of sampling audio turns, reveal that audio embeddings lead to overall higher task success (than without using audio embeddings) with statistically significant results across evaluation metrics and training algorithms

    Corrective Focus Detection in Italian Speech Using Neural Networks

    The corrective focus is a particular kind of prosodic prominence where the speaker is intended to correct or to emphasize a concept. This work develops an Artificial Cognitive System (ACS) based on Recurrent Neural Networks that analyzes suitablefeatures of the audio channel in order to automatically identify the Corrective Focus on speech signals. Two different approaches to build the ACS have been developed. The first one addresses the detection of focused syllables within a given Intonational Unit whereas the second one identifies a whole IU as focused or not. The experimental evaluation over an Italian Corpus has shown the ability of the Artificial Cognitive System to identify the focus in the speaker IUs. This ability can lead to further important improvements in human-machine communication. The addressed problem is a good example of synergies between Humans and Artificial Cognitive Systems.The research leading to the results in this paper has been conducted in the project EMPATHIC (Grant N: 769872) that received funding from the European Union’s Horizon2020 research and innovation programme.Additionally, this work has been partially funded by the Spanish Minister of Science under grants TIN2014-54288-C4-4-R and TIN2017-85854-C4-3-R, by the Basque Government under grant PRE_2017_1_0357,andby the University of the Basque Country UPV/EHU under grantPIF17/310

    Dialogue Management and Language Generation for a Robust Conversational Virtual Coach: Validation and User Study

    Designing human–machine interactive systems requires cooperation between different disciplines is required. In this work, we present a Dialogue Manager and a Language Generator that are the core modules of a Voice-based Spoken Dialogue System (SDS) capable of carrying out challenging, long and complex coaching conversations. We also develop an efficient integration procedure of the whole system that will act as an intelligent and robust Virtual Coach. The coaching task significantly differs from the classical applications of SDSs, resulting in a much higher degree of complexity and difficulty. The Virtual Coach has been successfully tested and validated in a user study with independent elderly, in three different countries with three different languages and cultures: Spain, France and Norway.The research presented in this paper has been conducted as part of the project EMPATHIC that has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant No. 769872. Additionally, this work has been partially funded by projects BEWORD and AMIC-PC of the Minister of Science of Technology, under Grant Nos. PID2021-126061OB-C42 and PDC2021-120846-C43, respectively. Vázquez and López Zorrilla received a PhD scholarship from the Basque Government, with Grant Nos. PRE 2020 1 0274 and PRE 2017 1 0357, respectively

    Can Spontaneous Emotions be Detected from Speech on TV Political Debates?

    Accepted paperDecoding emotional states from multimodal signals is an increasingly active domain, within the framework of affective computing, which aims to a better understanding of Human-Human Communication as well as to improve Human- Computer Interaction. But the automatic recognition of sponta- neous emotions from speech is a very complex task due to the lack of a certainty of the speaker states as well as to the difficulty to identify a variety of emotions in real scenarios. In this work we explore the extent to which emotional states can be decoded from speech signals extracted from TV political debates. The labelling procedure was supported by perception experiments where only a small set of emotions has been identified. In addition, some scaled judgements of valence, arousal and dominance were also provided. In this framework the paper shows meaningful comparisons between both, the dimensional and the categorical models of emotions, which is a new con- tribution when dealing with spontaneous emotions. To this end Support Vector Machines (SVM) as well as Feedforward Neural Networks (FNN) have been proposed to develop classifiers and predictors. The experimental evaluation over a Spanish corpus has shown the ability of both models to be identified in speech segments by the proposed artificial systems.This work has been partially funded by the Spanish Government under grant TIN2017-85854-C4-3-R (AEI/FEDER,UE) and conducted in the project EMPATHIC (Grant n769872) funded by the European Union’s H2020 research andinnovation program

    Iruzurrezko portaeren detekzioa crowd motako etiketazioan

    This work aims at detecting low quality labels in crowdsourcing annotation tasks. We validate our proposal carrying out experiments in a difficult and subjective task: emotion recognition. We have developed several measures in order to detect fraudulent behaviour, including measures related to the labelling time, worker inter-agreement and the distribution of the answers. Not only do we show that each of the described measures is helpful but we also demonstrate that mixing them is the best way to go.Lan honek crowd motako etiketazioan agertu daitezkeen kalitate baxuko etiketak detektatzea du helburu. Proposatutako metodologia balioztatzeko, saiakuntzak ataza zail eta subjektibo batekin egin ditugu: emozioen de- tekzioarekin. Iruzurrezko langileak topatzeko zenbait neurri proposatu dira, etiketatze denboran, langileen arteko adostasunean eta langileen erantzunen banaketan oinarriturikoak. Neurri bakoitza baliagarria dela frogatu dugun arren, gure ondorio nagusia neurriak batzerakoan iruzurrezko langileak detektatzeko probabilitatea handitzen dela da.Egileok gure esker ona adierazi nahiko genioke Euskal Herriko Unibertsitateari, Espainako gobernuako TIN2017- 85854-C4-3-R zenbakidun diru laguntzari eta H2020 Europako Batzordeko SC1-PM15 programako RIA 7 deial- diko 769872 zenbakidun laguntzari, hurrenez hurren, ikerketa hau babesteagatik

    Euskaraz hitz egiten ikasten duten makina autodidaktak

    Lan honetan sare neuronalen bidez euskaraz hitz egiten ikasten duen elkarrizketa sistema automatikoa aurkezten dugu. Horretarako, Turingen testaren ideia era konputazionalean inplementatzen duten sare neuronal sortzaile aurkariak erabili ditugu. Normalean erabiltzen diren ingelesezko corpusak baino bi magnitude ordena txikiagoa den euskarazko corpus batekin halako sareak entrenatzea badagoela frogatzen dugu. Amaitzeko, euskararen morfologia kontuan hartzen duen aurreprozesamendua erabiltzea komenigarria dela erakusten dugu. Dakigunaren arabera, sare neuronaletan oinarrituta dagoen euskarazko lehen elkarrizketa sistema aurkezten dugu.Lan honen egileok gure esker ona adierazi nahiko genioke Eusko Jaurlaritzari, Euskal Herriko Unibertsitateari eta baita Europar Batzordeari, PRE 2017 1 0357 eta PIF17/310 zenbakidun diru laguntzekin, eta H2020 SC1-PM15 programako RIA 7 deialdiko 769872 zenbakidun diru laguntzarekin, hurrenez hurren, ikerketa hau babesteagatik

    A Spanish Corpus for Talking to the Elderly

    Paper presented at 11th International Workshop on Spoken Dialogue Systems, IWSDS 2020; Madrid; Spain; 21 September 2020 through 23 September 2020In this work, a Spanish corpus that was developed, within the EMPATHIC project (http://www.empathic-project.eu/) framework, is presented. It was designed for building a dialogue system capable of talking to elderly people and promoting healthy habits, through a coaching model. The corpus, that comprises audio, video an text channels, was acquired by using a Wizard of Oz strategy. It was annotated in terms of different labels according to the different models that are needed in a dialogue system, including an emotion based annotation that will be used to generate empathetic system reactions. The annotation at different levels along with the employed procedure are described and analysed